26 research outputs found
3-D Hand Pose Estimation from Kinect's Point Cloud Using Appearance Matching
We present a novel appearance-based approach for pose estimation of a human
hand using the point clouds provided by the low-cost Microsoft Kinect sensor.
Both the free-hand case, in which the hand is isolated from the surrounding
environment, and the hand-object case, in which the different types of
interactions are classified, have been considered. The hand-object case is
clearly the most challenging task having to deal with multiple tracks. The
approach proposed here belongs to the class of partial pose estimation where
the estimated pose in a frame is used for the initialization of the next one.
The pose estimation is obtained by applying a modified version of the Iterative
Closest Point (ICP) algorithm to synthetic models to obtain the rigid
transformation that aligns each model with respect to the input data. The
proposed framework uses a "pure" point cloud as provided by the Kinect sensor
without any other information such as RGB values or normal vector components.
For this reason, the proposed method can also be applied to data obtained from
other types of depth sensor, or RGB-D camera
A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata
Images represent a commonly used form of visual communication among people.
Nevertheless, image classification may be a challenging task when dealing with
unclear or non-common images needing more context to be correctly annotated.
Metadata accompanying images on social-media represent an ideal source of
additional information for retrieving proper neighborhoods easing image
annotation task. To this end, we blend visual features extracted from neighbors
and their metadata to jointly leverage context and visual cues. Our models use
multiple semantic embeddings to achieve the dual objective of being robust to
vocabulary changes between train and test sets and decoupling the architecture
from the low-level metadata representation. Convolutional and recurrent neural
networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors
and query images. We perform comprehensive experiments on the NUS-WIDE dataset
showing that our models outperform state-of-the-art architectures based on
images and metadata, and decrease both sensory and semantic gaps to better
annotate images
Social and Scene-Aware Trajectory Prediction in Crowded Spaces
Mimicking human ability to forecast future positions or interpret complex
interactions in urban scenarios, such as streets, shopping malls or squares, is
essential to develop socially compliant robots or self-driving cars. Autonomous
systems may gain advantage on anticipating human motion to avoid collisions or
to naturally behave alongside people. To foresee plausible trajectories, we
construct an LSTM (long short-term memory)-based model considering three
fundamental factors: people interactions, past observations in terms of
previously crossed areas and semantics of surrounding space. Our model
encompasses several pooling mechanisms to join the above elements defining
multiple tensors, namely social, navigation and semantic tensors. The network
is tested in unstructured environments where complex paths emerge according to
both internal (intentions) and external (other people, not accessible areas)
motivations. As demonstrated, modeling paths unaware of social interactions or
context information, is insufficient to correctly predict future positions.
Experimental results corroborate the effectiveness of the proposed framework in
comparison to LSTM-based models for human path prediction.Comment: Accepted to ICCV 2019 Workshop on Assistive Computer Vision and
Robotics (ACVR
Knowledge Distillation for Action Anticipation via Label Smoothing
Human capability to anticipate near future from visual observations and
non-verbal cues is essential for developing intelligent systems that need to
interact with people. Several research areas, such as human-robot interaction
(HRI), assisted living or autonomous driving need to foresee future events to
avoid crashes or help people. Egocentric scenarios are classic examples where
action anticipation is applied due to their numerous applications. Such
challenging task demands to capture and model domain's hidden structure to
reduce prediction uncertainty. Since multiple actions may equally occur in the
future, we treat action anticipation as a multi-label problem with missing
labels extending the concept of label smoothing. This idea resembles the
knowledge distillation process since useful information is injected into the
model during training. We implement a multi-modal framework based on long
short-term memory (LSTM) networks to summarize past observations and make
predictions at different time steps. We perform extensive experiments on
EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action
classes, respectively. The experiments show that label smoothing systematically
improves performance of state-of-the-art models for action anticipation.Comment: Accepted to ICPR 202
Point-based Path Prediction from Polar Histograms
We address the problem of modeling complex target behavior using a stochastic model that integrates object dynamics, statistics gathered from the environment and semantic knowledge about the scene. The method exploits prior knowledge to build point-wise polar histograms that provide the ability to forecast target motion to the most likely paths. Physical constraints are included in the model through a ray-launching procedure, while semantic scene segmentation is used to provide a coarser representation of the most likely crossable areas. The model is enhanced with statistics extracted from previously observed trajectories and with nearly-constant velocity dynamics. Information regarding the target's destination may also be included steering the prediction to a predetermined area. Our experimental results, validated in comparison to actual targets' trajectories, demonstrate that our approach can be effective in forecasting objects' behavior in structured scenes
How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting
Accurate prediction of future human positions is an essential task for modern
video-surveillance systems. Current state-of-the-art models usually rely on a
"history" of past tracked locations (e.g., 3 to 5 seconds) to predict a
plausible sequence of future locations (e.g., up to the next 5 seconds). We
feel that this common schema neglects critical traits of realistic
applications: as the collection of input trajectories involves machine
perception (i.e., detection and tracking), incorrect detection and
fragmentation errors may accumulate in crowded scenes, leading to tracking
drifts. On this account, the model would be fed with corrupted and noisy input
data, thus fatally affecting its prediction performance.
In this regard, we focus on delivering accurate predictions when only few
input observations are used, thus potentially lowering the risks associated
with automatic perception. To this end, we conceive a novel distillation
strategy that allows a knowledge transfer from a teacher network to a student
one, the latter fed with fewer observations (just two ones). We show that a
properly defined teacher supervision allows a student network to perform
comparably to state-of-the-art approaches that demand more observations.
Besides, extensive experiments on common trajectory forecasting datasets
highlight that our student network better generalizes to unseen scenarios.Comment: Accepted by CVPR 202
3-d hand pose estimation from kinect’s point cloud using appearance matching
We present a novel appearance-based approach for pose estimation of a human hand using the point clouds provided by the low-cost Microsoft Kinect sensor. Both the free-hand case, in which the hand is isolated from the surrounding environment, and the hand-object case, in which the different types of interactions are classified, have been considered. The pose estimation is obtained by applying a modified version of the Iterative Closest Point (ICP) algorithm to the synthetic models. The proposed framework uses a “pure” point cloud as provided by the Kinect sensor without any other information such as RGB values or normal vector components